Load packages.
library(here) # A Simpler Way to Find Your Files, CRAN
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN
library(dada2) # Accurate, high-resolution sample inference from amplicon data, Bioconductor
library(plyr) # Tools for Splitting, Applying and Combining Data, CRAN
library(DT) # A Wrapper of the JavaScript Library 'DataTables', CRAN
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN
library(biomformat) # An interface package for the BIOM file format, Bioconductor
# Set seed
set.seed(1910)
Set the path to the fastq files:
path <- here::here("data/raw/casava-18-paired-end-demultiplexed-run1")
head(list.files(path))
## [1] "AqFl1-002_S1_L001_R1_001.fastq.gz" "AqFl1-002_S1_L001_R2_001.fastq.gz"
## [3] "AqFl1-003_S10_L001_R1_001.fastq.gz" "AqFl1-003_S10_L001_R2_001.fastq.gz"
## [5] "AqFl1-005_S19_L001_R1_001.fastq.gz" "AqFl1-005_S19_L001_R2_001.fastq.gz"
Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl1-002" "AqFl1-003" "AqFl1-005" "AqFl1-007" "AqFl1-009" "AqFl1-011"
We start by visualizing the quality profiles of the forward reads:
plotQualityProfile(fnFs[1:2]) +
scale_x_continuous(limits = c(0, 300), breaks = 0:6*50)
In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quantiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).
The forward reads are good quality. We will truncate the forward reads at position 290.
Now we visualize the quality profile of the reverse reads:
plotQualityProfile(fnRs[1:2]) +
scale_x_continuous(limits = c(0, 300), breaks = 0:6*50)
The reverse reads are of significantly worse quality, which is common in Illumina sequencing. Also, the average read length of the reverse reads is about 275 nt because the sequencing was stopped earlier than it should have been, which was caused by the lack of disk space in the Miseq. This isn’t too worrisome, as the 3’-end sequence will be truncated anyway to remove low-quality sequences. Based on these profiles, we will truncate the reverse reads at position 238 where the quality distribution crashes.
Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.
Assign filenames for the filtered fastq files.
# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
We’ll use standard filtering parameters: maxN = 0 (DADA2 requires no Ns), truncQ = 2, rm.phix = TRUE and maxEE = 2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft = c(20, 18). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(20, 18),
truncLen = c(290,238), maxN = 0, maxEE = c(2,2), truncQ = 2,
rm.phix = TRUE, compress = TRUE, multithread = TRUE)
head(out)
## reads.in reads.out
## AqFl1-002_S1_L001_R1_001.fastq.gz 146844 68210
## AqFl1-003_S10_L001_R1_001.fastq.gz 173709 90236
## AqFl1-005_S19_L001_R1_001.fastq.gz 197511 104402
## AqFl1-007_S28_L001_R1_001.fastq.gz 148087 61686
## AqFl1-009_S37_L001_R1_001.fastq.gz 156341 77440
## AqFl1-011_S46_L001_R1_001.fastq.gz 200487 100743
Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE = c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads we must maintain overlap after truncation in order to merge them later.
The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
errF <- learnErrors(filtFs, multithread = TRUE)
## 108532980 total bases in 401974 reads from 5 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 110597740 total bases in 502717 reads from 6 samples will be used for learning the error rates.
It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:
plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ = TRUE)
The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.
We are now ready to apply the core sample inference algorithm to the data. By default, the dada function processes each sample independently, which removes singletons in each samples. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. The dada2 package offers two types of pooling. dada(..., pool=TRUE) performs standard pooled processing, in which all samples are pooled together for sample inference. dada(..., pool="pseudo") performs pseudo-pooling, in which samples are processed independently after sharing information between samples, approximating pooled sample inference in linear time. Here, we use the default method (pool = FALSE).
dadaFs <- dada(filtFs, err = errF, pool = FALSE, multithread = TRUE)
## Sample 1 - 68210 reads in 15190 unique sequences.
## Sample 2 - 90236 reads in 16228 unique sequences.
## Sample 3 - 104402 reads in 14998 unique sequences.
## Sample 4 - 61686 reads in 10183 unique sequences.
## Sample 5 - 77440 reads in 18627 unique sequences.
## Sample 6 - 100743 reads in 19610 unique sequences.
## Sample 7 - 72837 reads in 16445 unique sequences.
## Sample 8 - 81454 reads in 18361 unique sequences.
## Sample 9 - 83126 reads in 11814 unique sequences.
## Sample 10 - 115127 reads in 24263 unique sequences.
## Sample 11 - 96689 reads in 15537 unique sequences.
## Sample 12 - 67676 reads in 13248 unique sequences.
## Sample 13 - 86248 reads in 17645 unique sequences.
## Sample 14 - 126941 reads in 20728 unique sequences.
## Sample 15 - 74213 reads in 13389 unique sequences.
## Sample 16 - 101823 reads in 19090 unique sequences.
## Sample 17 - 65493 reads in 10176 unique sequences.
## Sample 18 - 109522 reads in 22256 unique sequences.
## Sample 19 - 105175 reads in 20208 unique sequences.
## Sample 20 - 63943 reads in 8331 unique sequences.
## Sample 21 - 125870 reads in 20392 unique sequences.
## Sample 22 - 124835 reads in 23601 unique sequences.
## Sample 23 - 119248 reads in 22040 unique sequences.
## Sample 24 - 90226 reads in 17577 unique sequences.
## Sample 25 - 88013 reads in 18572 unique sequences.
## Sample 26 - 152678 reads in 27693 unique sequences.
## Sample 27 - 66530 reads in 13143 unique sequences.
## Sample 28 - 90660 reads in 18003 unique sequences.
## Sample 29 - 63571 reads in 14383 unique sequences.
## Sample 30 - 109060 reads in 21401 unique sequences.
## Sample 31 - 24713 reads in 8645 unique sequences.
## Sample 32 - 78956 reads in 14619 unique sequences.
## Sample 33 - 93207 reads in 16956 unique sequences.
## Sample 34 - 97949 reads in 13535 unique sequences.
## Sample 35 - 84490 reads in 11728 unique sequences.
## Sample 36 - 106303 reads in 19739 unique sequences.
## Sample 37 - 97202 reads in 23064 unique sequences.
## Sample 38 - 103305 reads in 14314 unique sequences.
## Sample 39 - 59365 reads in 11475 unique sequences.
## Sample 40 - 75997 reads in 16775 unique sequences.
## Sample 41 - 100446 reads in 19208 unique sequences.
## Sample 42 - 143002 reads in 21442 unique sequences.
## Sample 43 - 189743 reads in 35563 unique sequences.
## Sample 44 - 57452 reads in 11917 unique sequences.
## Sample 45 - 125168 reads in 20083 unique sequences.
## Sample 46 - 67528 reads in 12606 unique sequences.
## Sample 47 - 65127 reads in 13071 unique sequences.
## Sample 48 - 76106 reads in 7228 unique sequences.
## Sample 49 - 86327 reads in 7095 unique sequences.
## Sample 50 - 108601 reads in 8243 unique sequences.
## Sample 51 - 106934 reads in 9165 unique sequences.
## Sample 52 - 56647 reads in 5334 unique sequences.
## Sample 53 - 122246 reads in 9197 unique sequences.
## Sample 54 - 131381 reads in 10551 unique sequences.
## Sample 55 - 76817 reads in 7341 unique sequences.
## Sample 56 - 102788 reads in 8532 unique sequences.
## Sample 57 - 63760 reads in 5482 unique sequences.
## Sample 58 - 50493 reads in 5441 unique sequences.
## Sample 59 - 83276 reads in 7227 unique sequences.
## Sample 60 - 94762 reads in 17717 unique sequences.
## Sample 61 - 121524 reads in 17731 unique sequences.
## Sample 62 - 104731 reads in 14851 unique sequences.
## Sample 63 - 91763 reads in 7105 unique sequences.
## Sample 64 - 68434 reads in 6068 unique sequences.
## Sample 65 - 40712 reads in 3905 unique sequences.
dadaRs <- dada(filtRs, err = errR, pool = FALSE, multithread = TRUE)
## Sample 1 - 68210 reads in 23001 unique sequences.
## Sample 2 - 90236 reads in 26343 unique sequences.
## Sample 3 - 104402 reads in 24820 unique sequences.
## Sample 4 - 61686 reads in 20631 unique sequences.
## Sample 5 - 77440 reads in 25530 unique sequences.
## Sample 6 - 100743 reads in 31480 unique sequences.
## Sample 7 - 72837 reads in 25748 unique sequences.
## Sample 8 - 81454 reads in 24397 unique sequences.
## Sample 9 - 83126 reads in 23958 unique sequences.
## Sample 10 - 115127 reads in 33879 unique sequences.
## Sample 11 - 96689 reads in 25600 unique sequences.
## Sample 12 - 67676 reads in 24568 unique sequences.
## Sample 13 - 86248 reads in 27745 unique sequences.
## Sample 14 - 126941 reads in 36739 unique sequences.
## Sample 15 - 74213 reads in 23825 unique sequences.
## Sample 16 - 101823 reads in 29683 unique sequences.
## Sample 17 - 65493 reads in 20964 unique sequences.
## Sample 18 - 109522 reads in 32666 unique sequences.
## Sample 19 - 105175 reads in 30002 unique sequences.
## Sample 20 - 63943 reads in 20796 unique sequences.
## Sample 21 - 125870 reads in 37304 unique sequences.
## Sample 22 - 124835 reads in 37174 unique sequences.
## Sample 23 - 119248 reads in 38122 unique sequences.
## Sample 24 - 90226 reads in 27705 unique sequences.
## Sample 25 - 88013 reads in 27519 unique sequences.
## Sample 26 - 152678 reads in 42384 unique sequences.
## Sample 27 - 66530 reads in 19216 unique sequences.
## Sample 28 - 90660 reads in 33290 unique sequences.
## Sample 29 - 63571 reads in 24311 unique sequences.
## Sample 30 - 109060 reads in 32936 unique sequences.
## Sample 31 - 24713 reads in 13136 unique sequences.
## Sample 32 - 78956 reads in 22522 unique sequences.
## Sample 33 - 93207 reads in 28133 unique sequences.
## Sample 34 - 97949 reads in 23470 unique sequences.
## Sample 35 - 84490 reads in 18883 unique sequences.
## Sample 36 - 106303 reads in 33845 unique sequences.
## Sample 37 - 97202 reads in 33231 unique sequences.
## Sample 38 - 103305 reads in 25782 unique sequences.
## Sample 39 - 59365 reads in 20840 unique sequences.
## Sample 40 - 75997 reads in 22304 unique sequences.
## Sample 41 - 100446 reads in 34398 unique sequences.
## Sample 42 - 143002 reads in 37805 unique sequences.
## Sample 43 - 189743 reads in 49918 unique sequences.
## Sample 44 - 57452 reads in 20864 unique sequences.
## Sample 45 - 125168 reads in 35693 unique sequences.
## Sample 46 - 67528 reads in 20981 unique sequences.
## Sample 47 - 65127 reads in 22706 unique sequences.
## Sample 48 - 76106 reads in 13681 unique sequences.
## Sample 49 - 86327 reads in 15302 unique sequences.
## Sample 50 - 108601 reads in 16366 unique sequences.
## Sample 51 - 106934 reads in 16665 unique sequences.
## Sample 52 - 56647 reads in 12397 unique sequences.
## Sample 53 - 122246 reads in 19611 unique sequences.
## Sample 54 - 131381 reads in 20460 unique sequences.
## Sample 55 - 76817 reads in 14381 unique sequences.
## Sample 56 - 102788 reads in 15280 unique sequences.
## Sample 57 - 63760 reads in 12679 unique sequences.
## Sample 58 - 50493 reads in 10598 unique sequences.
## Sample 59 - 83276 reads in 15460 unique sequences.
## Sample 60 - 94762 reads in 29897 unique sequences.
## Sample 61 - 121524 reads in 28307 unique sequences.
## Sample 62 - 104731 reads in 24227 unique sequences.
## Sample 63 - 91763 reads in 16770 unique sequences.
## Sample 64 - 68434 reads in 11794 unique sequences.
## Sample 65 - 40712 reads in 8275 unique sequences.
Inspect the dada-class object
dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 498 sequence variants were inferred from 15190 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
## sequence
## 1 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGGAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## 2 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## 3 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGGACTGAGACACGGCCCAG
## 4 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAGTGAAATTAACTGATCCCTTCGGGGTGACGTTAATGGATCTAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTGCCTGTAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAACACTTCGAACCACATGGTTTGAAGATGAAAGGCGGCGCAAGCTGTCACTTACAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## 5 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACGGATGGGAGCTTGCTCCCAGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTGCCCTATAGTTGGGGATAACTCCGGGAAACCGGGGCTAATACCGAATGATATAATTTAGCTCCTGCTAGATTGTTGAAAGATGGTTTACGCTATCGCTATAGGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## 6 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCGGGAATCTCTTCTGATCCCTTCGGGGTGAAGAGAGTGGAACGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAGTAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGACCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 4201 1 2 158 0 0 1 TRUE
## 2 2942 2 1 158 0 0 2 TRUE
## 3 2876 3 4 162 0 0 1 TRUE
## 4 2543 4 3 160 0 0 2 TRUE
## 5 2176 5 5 174 0 0 1 TRUE
## 6 2033 6 7 158 0 0 1 TRUE
Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?
We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.
seqtab <- makeSequenceTable(mergers)
The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.
dim(seqtab)
## [1] 65 7878
Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab))) %>%
as.data.frame() %>%
rename("seqence length" = Var1) %>%
datatable(options = list(
columnDefs = list(list(className = 'dt-left', targets = c(0:2)))
))
Plot sequence length distribution
seqLen <- nchar(getSequences(seqtab)) %>%
as.data.frame() %>%
rename(seqLen = ".")
ggplot(seqLen, aes(x = seqLen)) +
geom_histogram(binwidth = 1, alpha = 0.2, position = "identity", colour = "red") +
geom_vline(aes(xintercept = mean(seqLen)), color = "blue", linetype = "dashed", size = 0.5) +
scale_x_continuous(breaks = seq(round_any(min(seqLen), 10),
round_any(max(seqLen), 10, f = ceiling),
round_any((max(seqLen)-min(seqLen))/10, 10, f = ceiling))) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)),
breaks = seq(0,
round_any(max(table(seqLen)), 10, f = ceiling),
round_any(max(table(seqLen))/10, 10, f = ceiling))) +
labs(x = "sequence length (bp)", title = "Amplicon length distribution") +
annotate("text", label = "mean length", x = mean(seqLen$seqLen)-2, y = max(table(seqLen)), hjust = 1, colour = "blue") +
theme_bw()
Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes. We’ll just leave them as they are.
The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool = FALSE during the sample inference, we should use method = "consensus" for the chimera removal.
seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, verbose = TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.972694
dim(seqtab.nochim)
## [1] 65 5384
The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.
Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:
getN <- function(x) sum(getUniques(x))
stats <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(stats) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(stats) <- sample.names
datatable(stats)
Plot the sequences stats:
p_stats <- stats %>%
as.data.frame() %>%
rownames_to_column("SampleID") %>%
mutate_at(vars("filtered":"nonchim"), ~100*.x/input) %>%
mutate(input = 100) %>%
gather(key = "step", value = "percent", -SampleID) %>%
mutate(step = factor(step, levels = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim"))) %>%
ggplot(aes(x = step, y = percent, color = SampleID)) +
geom_point() +
geom_line(aes(group = SampleID)) +
scale_y_continuous(breaks = 0:10*10) +
labs(x = "", y = "Reads retained (%)") +
theme_bw()
ggplotly(p_stats, tooltip = c("x", "y", "colour"))
Looks good! Except for the filtering step, we kept the majority of our raw reads. One sample lost 80% of reads during the filtering step. Keep an eye on this sample in the downstream data analysis.
Considerations: This is a great place to do a last sanity check. Outside of filtering, there should be no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.
Export feature table:
t(seqtab.nochim) %>%
make_biom() %>%
write_biom(here::here("data/intermediate/dada2/table-run1.biom"))
Export representative sequences:
uniquesToFasta(seqtab.nochim, fout = here::here("data/intermediate/dada2/rep-seqs-run1.fna"),
ids = colnames(seqtab.nochim))
The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.16). For more documentations and tutorials, visit the DADA2 website.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=nb_NO.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=nb_NO.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=nb_NO.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biomformat_1.20.0 plotly_4.9.4.1 DT_0.18 plyr_1.8.6
## [5] dada2_1.20.0 Rcpp_1.0.7 forcats_0.5.1 stringr_1.4.0
## [9] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [13] tibble_3.1.2 ggplot2_3.3.5 tidyverse_1.3.1 here_1.0.1
## [17] knitr_1.33
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 hwriter_1.3.2
## [3] ellipsis_0.3.2 rprojroot_2.0.2
## [5] htmlTable_2.2.1 XVector_0.32.0
## [7] GenomicRanges_1.44.0 base64enc_0.1-3
## [9] fs_1.5.0 rstudioapi_0.13
## [11] farver_2.1.0 fansi_0.5.0
## [13] lubridate_1.7.10 xml2_1.3.2
## [15] codetools_0.2-18 splines_4.1.1
## [17] Formula_1.2-4 jsonlite_1.7.2
## [19] Rsamtools_2.8.0 broom_0.7.6
## [21] cluster_2.1.2 dbplyr_2.1.1
## [23] png_0.1-7 compiler_4.1.1
## [25] httr_1.4.2 backports_1.2.1
## [27] assertthat_0.2.1 Matrix_1.3-4
## [29] lazyeval_0.2.2 cli_3.0.0
## [31] htmltools_0.5.1.1 tools_4.1.1
## [33] gtable_0.3.0 glue_1.4.2
## [35] GenomeInfoDbData_1.2.6 reshape2_1.4.4
## [37] ShortRead_1.50.0 Biobase_2.52.0
## [39] cellranger_1.1.0 jquerylib_0.1.4
## [41] rhdf5filters_1.4.0 vctrs_0.3.8
## [43] Biostrings_2.60.1 crosstalk_1.1.1
## [45] xfun_0.24 rvest_1.0.0
## [47] lifecycle_1.0.0 zlibbioc_1.38.0
## [49] scales_1.1.1 hms_1.0.0
## [51] MatrixGenerics_1.4.0 parallel_4.1.1
## [53] SummarizedExperiment_1.22.0 rhdf5_2.36.0
## [55] RColorBrewer_1.1-2 yaml_2.2.1
## [57] gridExtra_2.3 sass_0.4.0
## [59] rpart_4.1-15 latticeExtra_0.6-29
## [61] stringi_1.6.2 highr_0.9
## [63] S4Vectors_0.30.0 checkmate_2.0.0
## [65] BiocGenerics_0.38.0 BiocParallel_1.26.1
## [67] GenomeInfoDb_1.28.1 rlang_0.4.11
## [69] pkgconfig_2.0.3 bitops_1.0-7
## [71] matrixStats_0.59.0 evaluate_0.14
## [73] lattice_0.20-45 Rhdf5lib_1.14.2
## [75] labeling_0.4.2 GenomicAlignments_1.28.0
## [77] htmlwidgets_1.5.3 tidyselect_1.1.1
## [79] magrittr_2.0.1 R6_2.5.0
## [81] IRanges_2.26.0 generics_0.1.0
## [83] Hmisc_4.5-0 DelayedArray_0.18.0
## [85] DBI_1.1.1 pillar_1.6.1
## [87] haven_2.4.1 foreign_0.8-81
## [89] withr_2.4.2 survival_3.2-13
## [91] RCurl_1.98-1.3 nnet_7.3-16
## [93] modelr_0.1.8 crayon_1.4.1
## [95] utf8_1.2.1 rmarkdown_2.9
## [97] jpeg_0.1-8.1 grid_4.1.1
## [99] readxl_1.3.1 data.table_1.14.0
## [101] reprex_2.0.0 digest_0.6.27
## [103] RcppParallel_5.1.4 stats4_4.1.1
## [105] munsell_0.5.0 viridisLite_0.4.0
## [107] bslib_0.2.5.1